-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[repo depot 3/n] nexus background task to replicate TUF artifacts across sleds #7129
base: main
Are you sure you want to change the base?
Conversation
Update: I got enough plumbing done to get an integration test working, so I'll have that here soon. |
ba36c79
to
b2fd749
Compare
pub(super) struct SimArtifactStorage { | ||
root: Utf8TempDir, | ||
backend: Arc<Mutex<Storage>>, | ||
dirs: Arc<(Utf8TempDir, Utf8TempDir)>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we come up with a named type for this, or at least a comment? It's hard to tell what this tuple is for.
(Are there two of these to emulate having two M.2s?)
ALTER TABLE omicron.public.sled | ||
ADD COLUMN IF NOT EXISTS repo_depot_port INT4 | ||
CHECK (port BETWEEN 0 AND 65535) | ||
NOT NULL DEFAULT 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We definitely do want the default here -- it's necessary to perform this upgrade -- but on existing systems, would we ever patch this value? Are we relying on the sled_upsert to update this value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are relying on sled_upsert
, yeah. I am not actually certain if that runs every time Sled Agent starts, I did not look that closely — it seems that there's a background task that periodically notifies Nexus of Sled Agent's existence?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if anyone tries using this column before that value propagates, they'll try communicating with port zero on the sled agent, right?
Should we set this value to 12348
instead of zero? It's a constant anyway, that seems like it would avoid this period of "seeing repot_depo_port = 0"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is 12348
the actual port in use?
If we don't want to assume that, then NULL
is essentially a legal state and it seems like we should use NULL
here and Nexus should generate an error if it runs into a case where it needs this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
12348 is the port in use, it's a const:
Lines 31 to 32 in dcc0df3
pub const SLED_AGENT_PORT: u16 = 12345; | |
pub const REPO_DEPOT_PORT: u16 = 12348; |
As noted elsewhere the main reason this is in the database is to support tests, where the const is not used. So I think a 12348 default for current sleds is relatively safe. (Making this nullable instead I think would be the right thing if we were rolling into this upgrade instead of doing mupdate.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear: the assumption here is that any system on which the migration is running is a "real" system and for those the port has always been 12348 and so filling it in this way is definitely correct. That makes sense to me. Add a comment to that effect?
# The default activation period for this task is 60s, but we want to activate it | ||
# manually to test the result of each activation. | ||
tuf_artifact_replication.period_secs = 600 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend using a disable
config to just explicitly turn this off. Our tests don't run for 600 seconds, until sometimes they do -- disable = true
will just stop it for good
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the comment trying to say that the config is specifying a value of 600s[, which is greater than 60s] so that it will only be activated manually? I think that could be more explicit.
disable
is a nice pattern. To be clear though I believe you'd need to implement the disable
flag yourself (e.g., the way instance_updater
task does). I don't think that's a thing you can just set on the config for any background task and have the driver skip it for you.
But it won't work here, at least not if you implement it so that disable
means "skip all activations of this task" (which is what the other implementors of disable
do). That will prevent manual activations from doing anything, too.
If it's just for tests, I think it's reasonable to set this to some astronomical number (much bigger than 10m), but even better if the test can be resilient to extra activations (see my other comment).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking based on Sean's comment that I could disable periodic activation without disabling manual activation but that doesn't seem to be the case. If background tasks were passed the ActivationReason
it'd be much easier to disable periodic activation without disabling manual activation.
It looks like our "terminate slow tests" timeout in CI is 15 minutes, so setting it to 3600s would be reasonable for tests I think.
//! 1. The task moves `ArtifactsWithPlan` objects of the `mpsc` channel and into | ||
//! a `Vec`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
//! 1. The task moves `ArtifactsWithPlan` objects of the `mpsc` channel and into | |
//! a `Vec`. | |
//! 1. The task moves `ArtifactsWithPlan` objects off the `mpsc` channel and into | |
//! a `Vec`. |
?
//! 5. The task generates a list of requests that need to be sent: | ||
//! - PUT each locally-stored artifact not present on any sleds to a random | ||
//! sled. | ||
//! - For each partially-replicated artifact, choose a sled that is missing | ||
//! the artifact, and tell it (via `artifact_copy_from_depot`) to fetch the | ||
//! artifact from a random sled that has it. | ||
//! - DELETE all artifacts no longer tracked in CockroachDB from all sleds | ||
//! that have that artifact. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are your thoughts on the following scenario:
- Nexus A runs, is really slow. Tries to tell all sleds to use "version N" of artifacts, and delete everything else.
- Slightly later, Nexus B runs, and is much faster. Tries to tell all sleds to use "version N + 1" of artifacts, and delete everything else.
- Sled Agents see a mix of requests from Nexus A and B, and end up with a mix of "version N" and "version N+1" artifacts.
We usually use generation numbers when sending requests to sled agents to prevent us from moving backwards. Do we have something similar here? Should we?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where would we store the generation number, and who would verify it?
Presumably we'd store it in CockroachDB somewhere, but I think we want a single generation number for the complete list of all artifacts, so we can't store it for each tuf_repo
row.
Would sleds store the highest generation number they've seen, and reject any requests with a lower number?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've used this pattern for storing config information about zones, datasets, and physical disks.
- Nexus stores the generation number in cockroachdb, so that it can be agreed upon between multiple instances.
- It gets sent to each sled agent as part of the request, and is stored on the M.2
- Sled agents ignore requests with generation numbers lower than their last-seen value
- If the generation number is equal to their last seen value, they can confirm that the contents are the same, and otherwise throw an error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I'm understanding the big picture right: this task isn't telling sleds to use artifacts from one repo or the other. It's just copying files around. I expect those files aren't used for anything today. A follow-on step will be to have Sled Agent use these artifacts when creating control plane zones. But when it does that, the request from Reconfigurator will say exactly which artifact to use. At no point does Sled Agent care which TUF repo the artifacts are associated with. And it's not limited to having only one artifact for each kind of zone or anything like that.
If all that's correct, then I think the essential constraints are:
- Once a TUF repo is uploaded, then as long as it remains present in CockroachDB, all sleds should converge to having all of the artifacts in that repo.
- Once a sled has any artifact from a TUF repo that's present in CockroachDB, it should not remove that artifact until after that repo has been removed from CockroachDB. This is just saying that we never go "backwards".
- Once an artifact is no longer referenced by any TUF repos present in CockroachDB, then all sleds should converge to not having it.
- This is less critical but: once a sled has removed an artifact that's not referenced by any TUF repos present in CockroachDB, then it should not download it again (unless some other TUF repo is uploaded that references it).
Importantly, for any two TUF repos N
and N+1
, whether they get uploaded around the same time or not, as long as both remain present in CockroachDB, it seems perfectly fine for a sled to have literally any combination of artifacts from both repos (including: none of N
and all of N + 1
, all of N
and none of N + 1
, etc.) -- as long as Nexus is making sure that all sleds eventually get all of them.
Does all that seem right @iliana?
We could model this instead as: there's a set of artifacts that all sleds should have, that has a generation number, and that number gets bumped whenever a TUF repo gets added or removed. But I don't think that's necessary or helpful here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is 100% correct; however Sean's concern is also correct in that two Nexuses could be running this background task simultaneously without conferring with each other. Because the only fetch the list of artifacts from the database once, the different tasks that are executing can have different views of what the correct list of artifacts is.
Sean's definition of how a problem could manifest here I think is somewhat incorrect, but I think this problem could indeed occur:
- Nexus A runs, fetches the list of artifacts at time N, and then goes off to ask all the sleds what they've got. The first sled's response is slowed by congestion/cosmic rays/gremlins/whatever.
- Nexus B runs because it just got a new repo from an operator, fetches the list of artifacts at time N+1, and starts replicating its local artifacts to other sleds. This all happens before Nexus A gets its first response.
- Nexus A gets its responses, sees all these sleds have artifacts it doesn't know about, and tells all the sleds to delete them.
The risk here is that the rack loses all of the copies of an artifact. But Nexus B won't delete its local copy of the repository artifacts until the next time it runs: local repos are only dropped if the information we've collected from all the sleds indicate sufficient replication of all the local artifacts in that repository. By that time it will see that all the sleds no longer have a copy and it'll try to replicate them again.
It's also not very likely that this particular problem will occur except in a very contrived case, because each task execution is limited to performing 32 API calls. But if we ever shipped a patch repo that changed only a single artifact it becomes much less contrived. (Or bumped those consts up, because the numbers aren't really based on anything?)
A generation number would add some defense in depth here, but I am not quite sure whether it's helpful or not.
// or three on the system during an upgrade (we've sized the artifact | ||
// datasets to fit at most 10 repositories for this reason). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you give more context to the claim "We've sized the artifact datasets to fit at most 10 repositories"?
Are you saying the M.2 'update' dataset is 10x the size of "a repo"? but which repo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
omicron/sled-storage/src/dataset.rs
Lines 50 to 53 in dcc0df3
// TODO-correctness: This value of 20 GiB is a wild guess -- given TUF repo | |
// sizes as of Oct 2024, it would be capable of storing about 10 distinct system | |
// versions. | |
pub const ARTIFACT_DATASET_QUOTA: ByteCount = ByteCount::from_gibibytes_u32(20); |
All of these numbers are dubious.
"failed to send artifacts for replication: {err}" | ||
)) | ||
})?; | ||
// Now store the artifacts in the database. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we send artifacts_with_plan
, write to the database, and crash before the background task completes, what happens to the state durably recorded in the database? Wouldn't we have a tuf_repo
recorded in a database table, that doesn't actualyl exist on the rack?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We would. There's a // TODO
in the background task to mark individual repositories or artifacts as "failed" if any of the artifacts are verifiably missing. But this is not yet implemented.
However I think given the comment below I'm going to flip this logic. We still need to implement some way of noticing missing artifacts though.
// If the last sender for this channel is dropped, then the | ||
// `Nexus` type has also been dropped. This is presumably | ||
// a bug. | ||
panic!("artifact replication receiver disconnected"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I know it's unlikely, but can we return an error here instead? This behavior makes implementing clean-shutdown really painful
// `Utf8TempDir`s storing the artifacts, into the artifact replication | ||
// background task. This is done before the database insert because if | ||
// this fails, we shouldn't record the artifacts in the database. | ||
self.tuf_artifact_replication_tx |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're sending the artifacts on this channel before we write them into the DB -- if the background task activated after we write to this mpsc, but before we call update_tuf_repo_insert
, would discard the TUF artifacts we just tried to add?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. I think it would. This seems clearly worse than "we recorded the artifacts in the database, but failed to move the artifacts into the mpsc", right? Particularly since failing to move the artifacts shouldn't ever really happen unless somehow the background tasks are dead.
Probably I will flip these around, which also gives me the opportunity to avoid sending the artifacts and activating the task if the repo is either already uploaded or was rejected for whatever reason.
if new_status == status { | ||
panic!("TUF artifact replication stalled: {new_status:?}"); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This background task involves replicating to a random selection of sleds. If we pick the same set twice, would we stall, and trigger this panic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's possible to pick the same set twice, unless errors are occurring, and we panic within these test functions if any errors are recorded by the task.
@@ -375,6 +375,7 @@ mod test { | |||
let sled = SledUpdate::new( | |||
*sled_id.as_untyped_uuid(), | |||
"[::1]:0".parse().unwrap(), | |||
0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are quite a lot of places where we use 0
for the repo depot port. I assume this is a sentinel value? It might be nice to use Option
instead here. (see also the discussion about whether the field should be NULLable but I think this is true regardless).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general in these test functions, I attempted to follow the same usage of how the sled agent port was specified. In this case you can see the sled agent SocketAddr is localhost port 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that makes sense.
The end result is that there are many callers where we're repeating the same values that, if I'm understanding right, can't actually be right -- they're just unused. This makes me wonder if both of those ought to be optional. Maybe this should be a SledUpdateBuilder
? But anyway it's fine to say that's out of scope here.
/// Number of requests that were not sent during this activation because the | ||
/// limits had been reached. | ||
pub requests_outstanding: usize, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would read "outstanding" to mean "in-flight" rather than "not sent". Is it "outstanding" in the sense of "these requests remain to-be-done and will be done on the next activation"? How about calling this: "requests_deferred? (or: "requests_wanted" or "requested_skipped" or "requests_overflow" or "requests_over_limit")
/// The number of repositories this Nexus instance is keeping in local | ||
/// storage. Local repositories are deleted once all their artifacts are | ||
/// sufficiently replicated. | ||
pub local_repos: usize, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
even nittier and feel free to ignore: what do you think of unreplicated_repos
? Honestly I'm torn because I prefer the fact (the count of local repos) over the inference (they're under-replicated), but I also feel like "local_repos" doesn't communicate its importance if you don't already know how this system works.
-- TODO: Repos fetched over HTTP will not have a SHA256 hash; this is an | ||
-- implementation detail of our ZIP archives. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I'd strike this -- we'll have to deal with it if/when we add HTTP support but I'm not sure the comment helps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added it because it will certainly help me in the future :) It's more of a reminder.
@@ -2371,6 +2380,8 @@ CREATE TABLE IF NOT EXISTS omicron.public.tuf_artifact ( | |||
PRIMARY KEY (name, version, kind) | |||
); | |||
|
|||
CREATE INDEX IF NOT EXISTS tuf_artifact_sha256 ON omicron.public.tuf_artifact (sha256); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CREATE INDEX IF NOT EXISTS tuf_artifact_sha256 ON omicron.public.tuf_artifact (sha256); | |
CREATE UNIQUE INDEX IF NOT EXISTS tuf_artifact_sha256 ON omicron.public.tuf_artifact (sha256, id); |
(pagination indexes should generally be unique)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has the effect of adding a UNIQUE constraint to the sha256
column, which causes the bogus repos we generate during testing to fail to upload (because many of the bogus artifacts have the same sha256 checksum despite having different name/version/kind values). Should I instead be paginating on the composite primary key? Or is this one of the exceptions to the rule (and I should add a comment as to why)?
if let Some(presence) = ArtifactHash::from_str(&hash) | ||
.ok() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I got confused because we're swallowing the error when the hash doesn't parse. Would it be reasonable to:
let Ok(hash) = ArtifactHash::from_str(&hash) else {
error!(log, "sled reported bogus artifact hash"; ...);
continue;
}
if let Some(presence) = artifacts.get_mut(&hash) {
...
presence.counts.insert(sled.id, count); | ||
presence.sleds.push(sled); | ||
} else { | ||
delete_requests.push(Request::Delete { sled, hash }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest an info
-level log message here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will already emit an info!
or higher-severity message if this request occurs (at the bottom of the execute
function). I think this would unnecessarily spam the logs if there's any artifacts for deletion.
// Mark all the artifacts found in `self.local` as locally available. | ||
// If all of the artifacts in an `ArtifactsWithPlan` are sufficiently | ||
// replicated, drop the `ArtifactsWithPlan`. | ||
let sufficient_sleds = MIN_SLED_REPLICATION.min(sleds.len()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the min
here just to handle developer systems that don't have MIN_SLED_REPLICATION sleds? If so I'd put this into configuration for this task (which is already distinct for the test suite / omicron-dev vs. a real system) so that we never accidentally consider this acceptable in a production system.
// TODO: If there are any missing artifacts with no known copies, | ||
// check the status of the assigned Nexus for its repositories. If the | ||
// assigned Nexus is expunged, or if we are the assigned Nexus and we | ||
// don't have the artifact, mark the repository as failed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we file an issue for this? (I'd also strike the comment but that's less important.)
@@ -0,0 +1,447 @@ | |||
// This Source Code Form is subject to the terms of the Mozilla Public |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've got two comments here about testing and observability. These aren't blockers.
On testing, and feel free to ignore or defer this: there aren't any automated tests here, but I think it might be much easier to automatically test the logic here if it's separated into three phases: (1) collect data from the sled agents and the database, (2) determine what to do (this would produce the list of requests, and potentially also a status summary), (3) make the requests. All the tricky bits would be in (2) and you could write exhaustive tests for that (that have TUF repos come and go, sleds come and go, etc.) without needing to mess around with faking up database state or fake sled agents.
On observability: I'm coming at this like I come at all subsystems like this, which is: assuming at some point we do find a bug in production with it, how can we make sure we can debug it the first time it happens? The kinds of bugs I can imagine are (and these overlap a bunch):
- TUF repo doesn't seem to be replicating
- artifact is unexpectedly missing from some sled
- artifact took too long to be replicated
- artifact was unexpectedly deleted
- artifact is unexpectedly present on some sled
- artifact took too long to be deleted everywhere
- artifact was unexpectedly re-replicated
I think a few things would help us root-cause any of these problems from the first time they happen:
- "info"-level log message when a local repo is created or deleted
- "info"-level log entry for every download/delete request that we make to the sled
- maybe? "info"-level log entry with each activation listing the TUF repos (or artifacts?) that are believed to be "present"
- cumulative counters for the stuff we have last-activation counters for, plus one for "repos that we've received and then successfully replicated" and "repos that we've received and then permanently failed to replicate".
It would also be really nice to have a ringbuffer of the last N replicate/delete requests made to sleds. If we make this big enough (say, 32 sleds * 4 repos), we could conceivably have the full history for a particular TUF repo right there in the ringbuffer. This is more work than the others and redundant with the log entries, but much easier to debug with... if Nexus hasn't restarted.
I'd suggest doing at least the "info"-level log entries (some of which may already exist). Recent customer discussions have emphasized the criticality of being able to debug problems without an Oxide person having direct access to the system so I think it's better for us to be proactive rather than wait to discover a bug here.
This mostly rounds out the implementation of uploading TUF repos to the rack. The module doc comment describes the flow:
omicron/nexus/src/app/background/tasks/tuf_artifact_replication.rs
Lines 7 to 46 in ba36c79
There's an integration test, but I eventually want to add a live test too to test the behavior of
MIN_SLED_REPLICATION
. I am manually testing that behavior now.(previous out-of-date text about tests)
There are no tests yet; I've been weighing my options, although I think it's unlikely any will be ready before I head out on Tuesday for a week:
MIN_SLED_REPLICATION
. But I will need to teach the test context to resolve the external Nexus API in order to upload the repository.After this lands, the remaining work is:
system_update_put_repository
and Sled Agentartifact_put
.